-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4465] Optimizing file-listing sequence of Metadata Table #6016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bf5f6ae to
ecf4114
Compare
|
hey @alexeykudinkin : can you link the right jira for the patch. |
ecf4114 to
0ae8d0e
Compare
0ae8d0e to
6194091
Compare
40e7639 to
178c9f7
Compare
| .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BUCKET).withBucketNum("1").build()) | ||
| .build(); | ||
|
|
||
| Properties props = getPropertiesForKeyGen(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass HoodieTableConfig.POPULATE_META_FIELDS.defaultValue() instead of hard-coding true?
| // Once meta fields are disabled, it cant be re-enabled for a given table. | ||
| if (!getTableConfig().populateMetaFields() | ||
| && Boolean.parseBoolean((String) properties.getOrDefault(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue()))) { | ||
| && Boolean.parseBoolean((String) properties.getOrDefault(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary? it's already being type cast to String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not
| } | ||
|
|
||
| /** | ||
| * TODO elaborate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo? javadoc only right?
| FileSystem fs = FSUtils.getFs(pathForReader.toString(), new Configuration()); | ||
| // Read the content | ||
| HoodieHFileReader<IndexedRecord> reader = new HoodieHFileReader<>(fs, pathForReader, content, Option.of(writerSchema)); | ||
| HoodieHFileReader<IndexedRecord> reader = new HoodieHFileReader<>(null, pathForReader, content, Option.of(writerSchema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could affect HFile reading. I believe there is some validation in HFile system or HFile's reader context for fs to be non-null. I think we should still pass fs and still keep this line in HoodieHFileUtils#createHFileReader:
Configuration conf = new Configuration(false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i checked it and it actually doesn't use fs at all
| private List<String> getAllPartitionPathsUnchecked() { | ||
| try { | ||
| if (partitionColumns.length == 0) { | ||
| return Collections.singletonList(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be Collections.emptyList()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-partitioned table has exactly one partition, which we designate w/ ""
| public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> partitions) | ||
| throws IOException { | ||
| if (partitions.isEmpty()) { | ||
| return Collections.emptyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If BaseHoodieTableFileIndex#getAllPartitionPathsUnchecked returns Collections.singletonList("") then should we add an entry for "" in the map, or rather make getAllPartitionPathsUnchecked return Collections.emptyList()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this is somewhat dissonant, but that's just the way things are -- for non-partitioned tables it's assumed that the only partition that is there has to be identified by ""
|
|
||
| String keyGen = properties.getProperty("hoodie.datasource.write.keygenerator.class"); | ||
| if (!Objects.equals(keyGen, "org.apache.hudi.keygen.NonpartitionedKeyGenerator")) { | ||
| builder.setPartitionFields("some_nonexistent_field"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract to constant to standardize across tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should actually standardize on this one, it's just to stop the bleeding in misconfigured tests
|
|
||
| private def shouldValidatePartitionColumns(spark: SparkSession): Boolean = { | ||
| // NOTE: We can't use helper, method nor the config-entry to stay compatible w/ Spark 2.4 | ||
| spark.sessionState.conf.getConfString("spark.sql.sources.validatePartitionColumns", "true").toBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go into Spark2Adapter or Spark2ParsePartitionUtil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to
b038d97 to
705660e
Compare
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you please rebase? We can land once the CI is green.
… reflection in the hot-path
…cing it w/ - `CachingPath` object - Invoking more performnat unsafe ctors/utils
Use `CachingPath` in `SparkHoodieTableFileIndex`; Tidying up;
Refactored `SerializablePath` to serialize `URI` instead (to avoid parsing, when deser); Tidying up
705660e to
95ce817
Compare
| // Make sure key-generator is configured properly | ||
| ValidationUtils.checkArgument(recordKeyField == null || !recordKeyField.isEmpty(), | ||
| "Record key field has to be non-empty!"); | ||
| ValidationUtils.checkArgument(partitionPathField == null || !partitionPathField.isEmpty(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the validation message be more user-friendly? Let's say
"Partition path field has to be non-empty! For non-partitioned table, set key generator class to NonPartitionedKeyGenerator".
Also, why are these validations only added for SimpleKeyGenerator? Why not other keygens as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to put suggestions into exception messages -- exception messages should be focused on the problem triggering it, rather than on potential to remedy it (empty partition-path field is usually a sign of misconfiguration, since there's no default value, meaning that user passes "" explicitly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
Should we add these validations to other keygens as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. It will be taken up separately. @alexeykudinkin in case if you have a JIRA, please link it here. For simple keygen we need the validation because of misconfiguration of some tests that were passing “” as partition fields.
...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestTimeTravelQuery.scala
Outdated
Show resolved
Hide resolved
be9c519 to
46e53b5
Compare
| List<String> matchedPartitionPaths = FSUtils.getAllPartitionPaths(engineContext, metadataConfig, basePath) | ||
| List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change affects partitioned tables that meet both of these conditions 1) hoodie.table.partition.fields not present in table config, and 2) metadata disabled. getAllPartitionPathsUnchecked() treats them as non-partitioned table, and resulted in not loading any records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of the problem is that hoodie.table.partition.fields has to be properly configured -- the table would be considered non-partitioned by some parts of the code (outside of this one) so we need to make sure this is set properly.
Tips
What is the purpose of the pull request
This PR optimizes file-listing sequence of the Metadata Table to make sure it's on par or better than FS-based file-listing
Change log:
Pathnew Pathw/createUnsafePathwhere possibleTimestampFormatter,DateFormatterfor timezoneBaseTableMetadatatwice w/inBaseHoodieTableFileIndexBrief change log
See above
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.